Get our free extension to see links to code for papers anywhere online!Free add-on: code for papers everywhere!Free add-on: See code for papers anywhere!

Add to Chrome

Add to Firefox

Add to Edge

Topic:Food Recognition

What is Food Recognition? Food recognition is the process of identifying and categorizing different types of food in images or videos.

An Explorative Analysis of SVM Classifier and ResNet50 Architecture on African Food Classification

May 20, 2025

Chinedu Emmanuel Mbonu, Kenechukwu Anigbogu, Doris Asogwa, Tochukwu Belonwu

Abstract:Food recognition systems has advanced significantly for Western cuisines, yet its application to African foods remains underexplored. This study addresses this gap by evaluating both deep learning and traditional machine learning methods for African food classification. We compared the performance of a fine-tuned ResNet50 model with a Support Vector Machine (SVM) classifier. The dataset comprises 1,658 images across six selected food categories that are known in Africa. To assess model effectiveness, we utilize five key evaluation metrics: Confusion matrix, F1-score, accuracy, recall and precision. Our findings offer valuable insights into the strengths and limitations of both approaches, contributing to the advancement of food recognition for African cuisines.

* 7 pages, 9 figures

Via

Access Paper or Ask Questions

Advancing Food Nutrition Estimation via Visual-Ingredient Feature Fusion

May 13, 2025

Huiyan Qi, Bin Zhu, Chong-Wah Ngo, Jingjing Chen, Ee-Peng Lim

Abstract:Nutrition estimation is an important component of promoting healthy eating and mitigating diet-related health risks. Despite advances in tasks such as food classification and ingredient recognition, progress in nutrition estimation is limited due to the lack of datasets with nutritional annotations. To address this issue, we introduce FastFood, a dataset with 84,446 images across 908 fast food categories, featuring ingredient and nutritional annotations. In addition, we propose a new model-agnostic Visual-Ingredient Feature Fusion (VIF$^2$) method to enhance nutrition estimation by integrating visual and ingredient features. Ingredient robustness is improved through synonym replacement and resampling strategies during training. The ingredient-aware visual feature fusion module combines ingredient features and visual representation to achieve accurate nutritional prediction. During testing, ingredient predictions are refined using large multimodal models by data augmentation and majority voting. Our experiments on both FastFood and Nutrition5k datasets validate the effectiveness of our proposed method built in different backbones (e.g., Resnet, InceptionV3 and ViT), which demonstrates the importance of ingredient information in nutrition estimation. https://huiyanqi.github.io/fastfood-nutrition-estimation/.

* Accepted for publication in ACM International Conference on Multimedia Retrieval 2025

Via

Access Paper or Ask Questions

FoodTrack: Estimating Handheld Food Portions with Egocentric Video

May 07, 2025

Ervin Wang, Yuhao Chen

Abstract:Accurately tracking food consumption is crucial for nutrition and health monitoring. Traditional approaches typically require specific camera angles, non-occluded images, or rely on gesture recognition to estimate intake, making assumptions about bite size rather than directly measuring food volume. We propose the FoodTrack framework for tracking and measuring the volume of hand-held food items using egocentric video which is robust to hand occlusions and flexible with varying camera and object poses. FoodTrack estimates food volume directly, without relying on intake gestures or fixed assumptions about bite size, offering a more accurate and adaptable solution for tracking food consumption. We achieve absolute percentage loss of approximately 7.01% on a handheld food object, improving upon a previous approach that achieved a 16.40% mean absolute percentage error in its best case, under less flexible conditions.

* Accepted as extended abstract at CVPR 2025 Metafood workshop

Via

Access Paper or Ask Questions

Electronic Nose for Agricultural Grain Pest Detection, Identification, and Monitoring: A Review

May 02, 2025

Chetan M Badgujar, Sai Swaminathan, Alison Gerken

Abstract:Biotic pest attacks and infestations are major causes of stored grain losses, leading to significant food and economic losses. Conventional, manual, sampling-based pest recognition methods are labor-intensive, time-consuming, costly, require expertise, and may not even detect hidden infestations. In recent years, the electronic nose (e-nose) approach has emerged as a potential alternative for agricultural grain pest recognition and monitoring. An e-nose mimics human olfactory systems by integrating a sensor array, data acquisition, and analysis for recognizing grain pests by analyzing volatile organic compounds (VOCs) emitted by grain and pests. However, well-documented, curated, and synthesized literature on the use of e-nose technology for grain pest detection is lacking. Therefore, this systematic literature review provides a comprehensive overview of the current state-of-the-art e-nose technology for agricultural grain pest monitoring. The review examines employed sensor technology, targeted pest species type, grain medium, data processing, and pattern recognition techniques. An e-nose is a promising tool that offers a rapid, low-cost, non-destructive solution for detecting, identifying, and monitoring grain pests, including microscopic and hidden insects, with good accuracy. We identified the factors that influence the e-nose performance, which include pest species, storage duration, temperature, moisture content, and pest density. The major challenges include sensor array optimization or selection, large data processing, poor repeatability, and comparability among measurements. An inexpensive and portable e-nose has the potential to help stakeholders and storage managers take timely and data-driven informed actions or decisions to reduce overall food and economic losses.

Via

Access Paper or Ask Questions

Dietary Intake Estimation via Continuous 3D Reconstruction of Food

May 01, 2025

Wallace Lee, YuHao Chen

Abstract:Monitoring dietary habits is crucial for preventing health risks associated with overeating and undereating, including obesity, diabetes, and cardiovascular diseases. Traditional methods for tracking food intake rely on self-reported data before or after the eating, which are prone to inaccuracies. This study proposes an approach to accurately monitor ingest behaviours by leveraging 3D food models constructed from monocular 2D video. Using COLMAP and pose estimation algorithms, we generate detailed 3D representations of food, allowing us to observe changes in food volume as it is consumed. Experiments with toy models and real food items demonstrate the approach's potential. Meanwhile, we have proposed a new methodology for automated state recognition challenges to accurately detect state changes and maintain model fidelity. The 3D reconstruction approach shows promise in capturing comprehensive dietary behaviour insights, ultimately contributing to the development of automated and accurate dietary monitoring tools.

* 2025 CVPR MetaFood Workshop

Via

Access Paper or Ask Questions

Are Vision-Language Models Ready for Dietary Assessment? Exploring the Next Frontier in AI-Powered Food Image Recognition

Apr 09, 2025

Sergio Romero-Tapiador, Ruben Tolosana, Blanca Lacruz-Pleguezuelos, Laura Judith Marcos Zambrano, Guadalupe X. Bazán, Isabel Espinosa-Salinas, Julian Fierrez, Javier Ortega-Garcia, Enrique Carrillo de Santa Pau, Aythami Morales

Abstract:Automatic dietary assessment based on food images remains a challenge, requiring precise food detection, segmentation, and classification. Vision-Language Models (VLMs) offer new possibilities by integrating visual and textual reasoning. In this study, we evaluate six state-of-the-art VLMs (ChatGPT, Gemini, Claude, Moondream, DeepSeek, and LLaVA), analyzing their capabilities in food recognition at different levels. For the experimental framework, we introduce the FoodNExTDB, a unique food image database that contains 9,263 expert-labeled images across 10 categories (e.g., "protein source"), 62 subcategories (e.g., "poultry"), and 9 cooking styles (e.g., "grilled"). In total, FoodNExTDB includes 50k nutritional labels generated by seven experts who manually annotated all images in the database. Also, we propose a novel evaluation metric, Expert-Weighted Recall (EWR), that accounts for the inter-annotator variability. Results show that closed-source models outperform open-source ones, achieving over 90% EWR in recognizing food products in images containing a single product. Despite their potential, current VLMs face challenges in fine-grained food recognition, particularly in distinguishing subtle differences in cooking styles and visually similar food items, which limits their reliability for automatic dietary assessment. The FoodNExTDB database is publicly available at https://github.com/AI4Food/FoodNExtDB.

* Accepted at IEEE/CVF Computer Vision and Pattern Recognition Conference workshops 2025 (CVPRw) 10 pages, 4 figures, 2 tables

Via

Access Paper or Ask Questions

Benchmarking Post-Hoc Unknown-Category Detection in Food Recognition

Mar 24, 2025

Lubnaa Abdur Rahman, Ioannis Papathanail, Lorenzo Brigato, Stavroula Mougiakakou

Abstract:Food recognition models often struggle to distinguish between seen and unseen samples, frequently misclassifying samples from unseen categories by assigning them an in-distribution (ID) label. This misclassification presents significant challenges when deploying these models in real-world applications, particularly within automatic dietary assessment systems, where incorrect labels can lead to cascading errors throughout the system. Ideally, such models should prompt the user when an unknown sample is encountered, allowing for corrective action. Given no prior research exploring food recognition in real-world settings, in this work we conduct an empirical analysis of various post-hoc out-of-distribution (OOD) detection methods for fine-grained food recognition. Our findings indicate that virtual logit matching (ViM) performed the best overall, likely due to its combination of logits and feature-space representations. Additionally, our work reinforces prior notions in the OOD domain, noting that models with higher ID accuracy performed better across the evaluated OOD detection methods. Furthermore, transformer-based architectures consistently outperformed convolution-based models in detecting OOD samples across various methods.

Via

Access Paper or Ask Questions

Fraesormer: Learning Adaptive Sparse Transformer for Efficient Food Recognition

Mar 15, 2025

Shun Zou, Yi Zou, Mingya Zhang, Shipeng Luo, Zhihao Chen, Guangwei Gao

Abstract:In recent years, Transformer has witnessed significant progress in food recognition. However, most existing approaches still face two critical challenges in lightweight food recognition: (1) the quadratic complexity and redundant feature representation from interactions with irrelevant tokens; (2) static feature recognition and single-scale representation, which overlook the unstructured, non-fixed nature of food images and the need for multi-scale features. To address these, we propose an adaptive and efficient sparse Transformer architecture (Fraesormer) with two core designs: Adaptive Top-k Sparse Partial Attention (ATK-SPA) and Hierarchical Scale-Sensitive Feature Gating Network (HSSFGN). ATK-SPA uses a learnable Gated Dynamic Top-K Operator (GDTKO) to retain critical attention scores, filtering low query-key matches that hinder feature aggregation. It also introduces a partial channel mechanism to reduce redundancy and promote expert information flow, enabling local-global collaborative modeling. HSSFGN employs gating mechanism to achieve multi-scale feature representation, enhancing contextual semantic information. Extensive experiments show that Fraesormer outperforms state-of-the-art methods. code is available at https://zs1314.github.io/Fraesormer.

* 6 pages, 4 figures

Via

Access Paper or Ask Questions

GFT: Gradient Focal Transformer

Apr 14, 2025

Boris Kriuk, Simranjit Kaur Gill, Shoaib Aslam, Amir Fakhrutdinov

Abstract:Fine-Grained Image Classification (FGIC) remains a complex task in computer vision, as it requires models to distinguish between categories with subtle localized visual differences. Well-studied CNN-based models, while strong in local feature extraction, often fail to capture the global context required for fine-grained recognition, while more recent ViT-backboned models address FGIC with attention-driven mechanisms but lack the ability to adaptively focus on truly discriminative regions. TransFG and other ViT-based extensions introduced part-aware token selection to enhance attention localization, yet they still struggle with computational efficiency, attention region selection flexibility, and detail-focus narrative in complex environments. This paper introduces GFT (Gradient Focal Transformer), a new ViT-derived framework created for FGIC tasks. GFT integrates the Gradient Attention Learning Alignment (GALA) mechanism to dynamically prioritize class-discriminative features by analyzing attention gradient flow. Coupled with a Progressive Patch Selection (PPS) strategy, the model progressively filters out less informative regions, reducing computational overhead while enhancing sensitivity to fine details. GFT achieves SOTA accuracy on FGVC Aircraft, Food-101, and COCO datasets with 93M parameters, outperforming ViT-based advanced FGIC models in efficiency. By bridging global context and localized detail extraction, GFT sets a new benchmark in fine-grained recognition, offering interpretable solutions for real-world deployment scenarios.

* 11 pages, 3 tables, 5 figures

Via

Access Paper or Ask Questions

Self-Supervised Pretraining for Fine-Grained Plankton Recognition

Mar 14, 2025

Joona Kareinen, Tuomas Eerola, Kaisa Kraft, Lasse Lensu, Sanna Suikkanen, Heikki Kälviäinen

Abstract:Plankton recognition is an important computer vision problem due to plankton's essential role in ocean food webs and carbon capture, highlighting the need for species-level monitoring. However, this task is challenging due to its fine-grained nature and dataset shifts caused by different imaging instruments and varying species distributions. As new plankton image datasets are collected at an increasing pace, there is a need for general plankton recognition models that require minimal expert effort for data labeling. In this work, we study large-scale self-supervised pretraining for fine-grained plankton recognition. We first employ masked autoencoding and a large volume of diverse plankton image data to pretrain a general-purpose plankton image encoder. Then we utilize fine-tuning to obtain accurate plankton recognition models for new datasets with a very limited number of labeled training images. Our experiments show that self-supervised pretraining with diverse plankton data clearly increases plankton recognition accuracy compared to standard ImageNet pretraining when the amount of training data is limited. Moreover, the accuracy can be further improved when unlabeled target data is available and utilized during the pretraining.

Via

Access Paper or Ask Questions

Topic:Food Recognition

Papers and Code